Performance of a Multicore Matrix Multiplication Library
نویسنده
چکیده
Multicore processors promise dramatic improvements in performance, but their diverse and often unique architectures are a major inhibitor to software adoption. Algorithm libraries that operate at the chip level and are optimized across multiple cores provide the quickest route by which programmers can port or develop highperformance software for multicores. This paper reports on a flexible matrix multiplication library for the Cell Broadband EngineTM (BE) processor that meets or exceeds the performance of known matrix multiplication implementations on the Cell. In addition, the library operates within a larger framework for programming multicores that enables programmers to combine library code with multicore functions they have developed themselves.
منابع مشابه
PERI - Auto-tuning memory-intensive kernels for multicore
Abstract. We present an auto-tuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of search-based performance optimizations, popular in linear algebra and FFT libraries, to application-specific computational kernels. Our work applies this strategy to sparse matrix vector multiplication (SpMV), the explicit heat equation PDE on...
متن کاملSubdivision Surface Evaluation as Sparse Matrix-Vector Multiplication
We present an interpretation of subdivision surface evaluation in the language of linear algebra. Specifically, the vector of surface points can be computed by left-multiplying the vector of control points by a sparse subdivision matrix. This “matrix-driven” interpretation applies to any level of subdivision, holds for many common subdivision schemes (including Catmull-Clark and Loop), supports...
متن کاملHybrid Algorithms for Matrix Multiplication on Multicore Clusters
Hybrid programming (through messages and shared memory) has gained importance since the appearance of multicore cluster architectures, fruit of the technological advance of processors and the physical limitations imposed by traditional architectures. This new programming paradigm allows exploiting the new memory hierarchy offered by the architecture. The purpose of this work is to carry out a c...
متن کاملFast recursive matrix multiplication for multi-core architectures
In this article, we present a fast algorithm for matrix multiplication optimized for recent multicore architectures. The implementation exploits different methodologies from parallel programming, like recursive decomposition, efficient low-level implementations of basic blocks, software prefetching, and task scheduling resulting in a multilevel algorithm with adaptive features. Measurements on ...
متن کاملEffective Implementation of DGEMM on Modern Multicore CPU
In this paper we will present a detailed study on tuning double-precision matrix-matrix multiplication (DGEMM) on the Intel Xeon E5-2680 CPU. We selected an optimal algorithm from the instruction set perspective as well software tools optimized for Intel Advance Vector Extensions (AVX). Our optimizations included the use of vector memory operations, and AVX instructions. Our proposed algorithm ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007